Walmart Store Sales Forecasting

Use historical markdown data to predict store sales

Importing Libraries

Loading Datasets

We can see that the test dataset don't contain the features included in the train dataset, taking into consideration that these features (Temperature, Fuel price, MarkDowns, CPI and Unemployment) cannot be used in the test dataset due to their high dependences on the date, so it will be a good idea to delete them. but before that, we will make sure that these features don't provide any information on the target 'Weekly_Sales'.

Prepare Dataset for Training

Data Cleaning

Let's start by cleaning the data of both datasets. We will see if they have missing values, duplicates and see if eliminate them if thats the case.

Very important to take into account that both datasets are going to merge. Therefore, they must have one key column that has the same values. Hence, We will also see if the values are consistent in both datasets.

Another useful step is to facilate the acces to the 'Date' attribute by splitting it into its componenents (i.e. Year, Month and week,day).

Descriptive statistics & Visualizations

There are missing values for the MarkDown columns as mentioned previously. Approximately 70% of values are missing in each MarkDown column.

Weekly sales

The plot makes the right skewness clear, so most weeks have sales around the median. Also, we can see that the Weekly_Sales attribute has a large kurtosis which indicates the presence of extreme values, in other words, some weeks have high sales. It would be a good idea to know the origins of these extreme values.

Inference :


Finding Correlation

Here the Target Variable Weekly_Sales is numerical nature. So we can use Pearson Correlation coefficient or Spearman's rank correlation to find the Correlation between variables

Correlation of Numerical Features with Weekly_Sales Using HEATMAP

Inference:



Impute Numerical Data


#as mentioned previously, markdown have no impact on target. so we can drop them train = train.drop(['MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5'],axis = 1) test = test.drop(['MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5'],axis = 1)

Pandas- Profiling

Analysing target variable

Analysing Size of the store for each Type

Inference:

Analysing Weekly sales based on Size of the Store

Inference:

Analysing Weekly_Sales per Store

Inference:

Analysing Weekly_Sales per Type

The feature, Type, seems to be an ordinal variable in terms of Weekly_Sales. We can perform encoding taking this into account.

Time series features

Features like Temperature, Fuel_Price, CPI, and Unemployment do not have any linear correlation with Weekly_Sales as indicated by the correlation matrix. We plot their time series to examine further.

Inference:

  1. The economy is likely experiencing economic growth.
  2. Over time, CPI increases slightly, implying inflation.
  3. Fuel_Price rises as well.
  4. On the other hand, unemployment decreases. These effects are correlated with a strong economy.
  5. For Weekly_Sales, we see peaks before the end of 2010 and 2011.
  6. However, the cyclical behavior of Temperature does not seem to correlate with that of Weekly_Sales.

    *There does not seem to be any noticeable patterns in the other features that correspond to Weekly_Sales as well.

Evaluate Algorithms

Training and Testing Set

Normalising Data

Feature importance

Model Training

Linear Regression

SVR

RidgeCV

ElasticNet

Random Forest

XGboost

Inference:

RandomForest performs best compared to other models